Skip to contents

Introduction

This vignette demonstrates how to use the functions in the danubeoccurR package to download and clean fish species occurrence records from the Global Biodiversity Information Facility (GBIF) for the Danube River Basin. It covers the steps to filter, download, and clean the data to prepare it for further analysis.

Before starting, ensure that you have:

  • An active internet connection to access GBIF data.
  • A GBIF API key.

Additionally, ensure that you follow the installation guideline of the package hydrographr for your operating system.

Step 1: Install the Package from GitHub

To get started, you need to install the danubeoccurR, hydrographr and specleanr packages from GitHub. You can do this using the devtools package:

# Install the devtools package if you haven't already
install.packages("devtools")

# Install the danubeoccurR package from GitHub
devtools::install_github("ytorres-cambas/danubeoccurR")
devtools::install_github("glowabio/hydrographr")
devtools::install_github("AnthonyBasooma/specleanr")
library(danubeoccurR)
library(hydrographr)
library(specleanr)
library(dplyr)
library(sf)

Step 2: Downloading GBIF Data

After installing the package, you can download fish species occurrences using the download_gbif_records() function. This will fetch records that fall within the geographic boundary of the Danube River Basin.

# Define species to download. We will use only three species from a check list
# of fish species form the Danube River Basin. The complete check list is data
# set include with the package danubeoccurR
species_list <- species_checklist[1:2]
print(species_list)
#> [1] "Abramis brama"    "Acipenser baerii"
# Define a geographic bounding box to use for querying GBIF occurrences within
# the specified area. The bounding box should be a simple feature (sf) object, 
# typically imported into R with sf::read_sf().
# In this example, we use an sf object that is included in the danubeoccurR 
# package, so there is no need to import it manually.
bbox_danube_basin
#> Simple feature collection with 1 feature and 1 field
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 8 ymin: 42 xmax: 30 ymax: 51
#> Geodetic CRS:  WGS 84
#>   gid                           geom
#> 1   1 POLYGON ((8 51, 30 51, 30 4...
# Convert the polygon to WKT format using the `get_wkt` function. 
# This format is required for GBIF. The function takes an sf polygon object, 
# ensures that the polygon has a counterclockwise winding order, 
# and returns the WKT (Well-Known Text) representation of the polygon.
wkt <- get_wkt(bbox_danube_basin)
print(wkt)
#> [1] "POLYGON ((8 51, 8 42, 30 42, 30 51, 8 51))"

This function retrieves fish species occurrences from GBIF based on the species list and the polygon that represents the Danube River Basin. We set import_to_r = TRUE to import the occurrences into R. The function returns a list with a data frame with occurrences and data for citation.

# Get the path to the temporary directory to save occurrences
temp_path <- tempdir() 

# Download occurrences from GBIF
gbif_data <- download_gbif_records(species_names = species_list,
                                   wkt = wkt,
                                   output_occur_path = temp_path,
                                   gbif_user = "your_username",
                                   gbif_pwd = "your_password",
                                   gbif_email="your_email@example.com",
                                   import_to_r = TRUE)

Step 3: Cleaning GBIF Data

Once the occurrences are downloaded, we will clean the data using clean_gbif(). This function removes duplicates, invalid coordinates, and other quality issues. This ensures that the data is ready for analysis, without missing or erroneous records.

# Clean the downloaded GBIF records
gbif_cleaned <- clean_gbif(gbif_data$raw_download,
                                  coordinateUncertaintyInMeters = 1000,
                                  coordinatePrecision = 0.01,
                                  buffer_centroid_zoo_herbaria = 1000,
                                  buffer_centroid_capitals = 1000,
                                  buffer_centroid_countries = 1000)
#> Testing country centroids: Removed 4 records.
#> Testing country capitals: Removed 1 records.
#> Testing biodiversity institutions: Removed 27 records.
#> Testing missing year or species: Removed 271 records.
#> Retained 2149 out of 2453 records after cleaning.

As we can see, 2149 out of 2453 records were retained after the cleaning procedure.

# Number of records before cleaning
nrow(gbif_data$raw_download)
#> [1] 2453

# Number of records after cleaning
nrow(gbif_cleaned)
#> [1] 2149

# Preview the cleaned data
head(gbif_cleaned)
#> # A tibble: 6 × 50
#>      gbifID datasetKey      occurrenceID kingdom phylum class order family genus
#>     <int64> <chr>           <chr>        <chr>   <chr>  <lgl> <chr> <chr>  <chr>
#> 1 4.55e-315 6ac3f774-d9fb-… ""           Animal… Chord… NA    Cypr… Cypri… Abra…
#> 2 4.55e-315 6ac3f774-d9fb-… ""           Animal… Chord… NA    Cypr… Cypri… Abra…
#> 3 4.55e-315 6ac3f774-d9fb-… ""           Animal… Chord… NA    Cypr… Cypri… Abra…
#> 4 4.55e-315 6ac3f774-d9fb-… ""           Animal… Chord… NA    Cypr… Cypri… Abra…
#> 5 4.55e-315 6ac3f774-d9fb-… ""           Animal… Chord… NA    Cypr… Cypri… Abra…
#> 6 4.55e-315 6ac3f774-d9fb-… ""           Animal… Chord… NA    Cypr… Cypri… Abra…
#> # ℹ 41 more variables: species <chr>, infraspecificEpithet <chr>,
#> #   taxonRank <chr>, scientificName <chr>, verbatimScientificName <chr>,
#> #   verbatimScientificNameAuthorship <chr>, countryCode <chr>, locality <chr>,
#> #   stateProvince <chr>, occurrenceStatus <chr>, individualCount <int>,
#> #   publishingOrgKey <chr>, decimalLatitude <dbl>, decimalLongitude <dbl>,
#> #   coordinateUncertaintyInMeters <dbl>, coordinatePrecision <dbl>,
#> #   elevation <dbl>, elevationAccuracy <dbl>, depth <lgl>, …

Step 4: Visualizing the Occurrences

We can now visualize the occurrences on a map using visualize_points().

visualize_points(points_df = gbif_cleaned,
                 layer = danube_basin)

As shown, some occurrence records fall outside the Danube River Basin. This happens because we used a bounding box as the spatial reference when querying GBIF. To limit the data strictly to the Danube River Basin, we can apply spatial filtering using the get_spatial_subset() function. This function filters the points based on a layer representing the Danube River Basin, which is included as one of the datasets in the danubeoccurR package.

# Spatial filtering of occurrence records
gbif_filtered <- get_spatial_subset(danube_basin,
                                    gbif_cleaned,
                                    "decimalLatitude",
                                    "decimalLongitude",
                                     verbose = TRUE)

Now, the map displays only the occurrence records that are strictly located within the Danube River Basin.

visualize_points(points_df = gbif_filtered,
                 layer = danube_basin)

Step 6: Checking species names

To clean the species names and ensure consistency with FishBase, we will use the check_names() function from the package specleanr. This function matches species names to FishBase and, by default, returns the accepted name if a synonym is provided. The function also checks for spelling errors and suggests the closest match in FishBase based on the similarity threshold defined by the pct parameter. We will use the default pct value of 90 that indicates a match of 90% or above is required.

# Check species names for inconsistencies 
gbif_name_checked <- check_names(data = gbif_filtered,
                                 colsp = "species",
                                 verbose = FALSE,
                                 pct = 90,
                                 merge = F,
                                 sn = FALSE)

# Preview the species names
print(gbif_name_checked)

As you can see, the species names in the species column, which correspond to the species names in our dataset, match the names in the speciescheck column from FishBase, indicating that no changes are required.

Step 7: Add additional taxonomic information

If taxonomic information for Genus, Family, or Order is missing from some occurrences, it can be retrieved from FishBase and added. While the example dataset does not have missing taxonomic information, its classification differs from that used in FishBase. Therefore, we will update the taxonomic classification in the example dataset based on the information retrieved from FishBase. We will use the species name list obtained in the previous step after verifying species names against FishBase.

# Return genus, family and order of the species in the input vector. 
tax_info <- get_taxonomy_info(species = gbif_name_checked$speciescheck, sources = "fishbase")

# Visualize result
print(tax_info)

# Visualize taxonomic classification used in the example dataset.
gbif_filtered %>%
  select(species, genus, family, order) %>%
  distinct(species, .keep_all = TRUE) %>%
  print()

The result is a data frame containing columns for Species, Genus, Family, and Order. In this case, the species Abramis brama is classified under the Family Leuciscidae; however, it was classified under Cyprinidae in the example dataset downloaded from GBIF. We will now proceed to update the Family classification of Abramis brama.

# Update taxonomic information
gbif_filtered[gbif_filtered$family=="Cyprinidae", "family"] <- "Leuciscidae"

Step 8: Add administrative information

Administrative information, such as country name and lower administrative levels (e.g., administrative levels 1 and 2 from the Database of Global Administrative Areas), can be added by performing a spatial join between points and GADM Administrative Boundaries using the spatial_join_gadm() function. If a vector layer is not supplied but an ISO3 country code is provided, the function attempts to download GPKG files from GADM and use them for the spatial join. If the download fails due to a slow connection, the timeout argument value can be increased to allow more time for the process. As a final step, the columns are renamed to align with Darwin Core standards. In this case, GADM administrative levels 1 and 2 correspond to the Darwin Core terms stateProvince and county, respectively. Therefore, we will rename the columns accordingly.

# Perform a spatial join of points with GADM Administrative Boundaries 
  admin_4_gbif <-  spatial_join_gadm(
  coords_df = gbif_filtered,
  country_codes = c("ALB", "AUT", "BIH", "BGR", "CHE", "CZE", "DEU", "HUN","ITA", "HRV",
                      "MDA", "MKD","MNE", "POL", "ROU", "SRB", "SVK", "SVN", "UKR", "XKO"),
  lat_col = "decimalLatitude",
  lon_col = "decimalLongitude",
  timeout = 1000)
# Rename columns
admin_4_gbif <- admin_4_gbif %>%
    select(-stateProvince) %>% # delete original column
    rename(country = COUNTRY,
           county = NAME_2,
           stateProvince = NAME_1)

Step 9: Extracting sub-catchment ids

Next, we will extract the IDs of the sub-catchments where each occurrence is located and add a new column with these IDs to the dataset. To achieve this, we will use the extract_ids() function from the hydrographr package. The sub-catchment layer was generated from the Hydrography90m dataset by downloading the tiles corresponding to the Danube River Basin and cropping them using functions from the hydrographr package.

Lets first download, crop and merge raster tiles with sub-catchments.


# Folder to save final Hydrography90m layers
# Create output folder if it doesn't exist
output_hydro90m <- paste0(getwd(), "/hydrography90m")
if(!dir.exists(output_hydro90m)) dir.create(output_hydro90m)

# Folder to save temporal files
temp <-  paste0(getwd(), "/temp")
if(!dir.exists(temp)) dir.create(temp)

# Define tile ID
id <- c("h18v02", "h18v04", "h20v02", "h20v04")

# Define raster variable
r_var <- c("sub_catchment")

# raster files
download_tiles(variable = r_var,
               file_format = "tif",
               tile_id = id,
               download_dir = temp)


# Get the full paths of the raster tiles
raster_dir <- list.files(paste0(temp, "/r.watershed"), pattern = ".tif",
                         full.names = TRUE, recursive = TRUE)

# Crop raster tiles to the extent of a sf object
for(itile in raster_dir) {

  crop_to_extent(raster_layer = itile,
                 bounding_box = bbox_danube_basin,
                 out_dir = temp,
                 file_name =  paste0(str_remove(basename(itile), ".tif"),
                                     "_tmp.tif"))

}

# Merge filtered raster tiles and save result to a directory
merge_tiles(tile_dir = temp,
            tile_names = list.files(temp,
                                    full.names = FALSE,
                                    pattern = "_tmp.tif"),
            out_dir = output_hydro90m,
            file_name = "subcatch_danube.tif",
            name = "ID",
            read = FALSE,
            compression = "high",
            quiet = FALSE)

Now, we are ready to extract the sub-catchment IDs.

# Create unique ids for rows in gbif_filtered. It is required by the function
# extract_ids. It will be removed later
admin_4_gbif$id_col <- seq(1:nrow(admin_4_gbif))

# Extract sub-catchment IDs 
ids_df <- extract_ids(
    data = admin_4_gbif,
    lon = "decimalLongitude",
    lat = "decimalLatitude",
    id = "id_col",
    basin_layer = NULL,
    subc_layer = paste0(getwd(), "/hydrography90m/subcatch_danube.tif"),
    quiet = FALSE
  ) 

# Check all occurrences have sub-catchment IDs
if(length(which(is.na(ids_df$subcatchment_id))) == 0) {
  print("All occurrences have sub-catchment IDs.")
} else {
  print(paste0(length(which(is.na(ids_df$subcatchment_id))),
               " occurrences are missing sub-catchment IDs."))
}

# New column with sub-catchment IDs
gbif_subcID <- admin_4_gbif %>%
  mutate(subcatchmentID = ids_df$subcatchment_id) %>%
  select(-id_col) # delete column with row ids 

Step 10: Add occurrence IDs

To ensure each record has a unique and persistent identifier, we assigned Universal Unique Identifiers (UUIDs) using the generate_global_identifier() function. This aligns with Darwin Core standards and facilitates data integration across biodiversity databases.

First, we identified records that lacked an occurrenceID. For these records, we generated and assigned a UUID using the generate_global_identifier() function. This step ensures that all occurrences have a globally unique identifier, improving traceability and interoperability with other datasets.

# Identify records that lack an occurrenceID
indx <- which(gbif_subcID$occurrenceID=="")

# Generate occurrence IDs
ids <- generate_global_identifier(n = length(indx))

# Add ids
gbif_subcID[indx,"occurrenceID"] <- ids

Step 11 Check data format and completeness

To ensure data consistency and compliance with Darwin Core standards, we will perform a series of format checks.

  • Column names were verified using the check_column_name() function to confirm they conform to Darwin Core terms as well as additional custom names.

  • Year values were checked for missing entries and validated to be numeric and within an acceptable range using the check_year_column() function. The function output is a list of: ** missing_values: Indices of missing values in the year column. ** invalid_years: Indices of values that fall outside the valid year range. ** updated_df: A data frame with the updated year values and a flag indicating

  • Coordinate validation was conducted using the check_coordinates() function to ensure latitude and longitude values follow the correct decimal degree format (WGS84).

These checks help maintain data quality and facilitate seamless integration with biodiversity databases. Finally, to ensure data completeness we will verify that each record includes at least a species name, date, and coordinates.

# Check column name
check_column_name(df_input = ,
                  standard_terms = c(dwc_names, "waterBodyType", "gbifID", "subcatchmentID"),
                               verbose = TRUE)

# Check for missing values, ensure that values are numeric and fall within a valid range. The result is a list with three elements that store "missing_values",
gbif_year_checked <- check_year_column(df = gbif_det_outlier,
                  col_year = "year",
                  year_range = c(1800, 2024))
print(gbif_year_checked$missing_values) 
print(gbif_year_checked$invalid_years)

# Check coordinates format and ensure they are numeric data type
gbif_coord_checked <- check_coordinates(df = gbif_year_checked$updated_df,
                                        lat_col = "decimalLatitude",
                                        lon_col = "decimalLongitude",
                                        convert_to_numeric = TRUE,
                                        verbose = TRUE)

# Check data completeness
  

Conclusion

This vignette shows how to download and clean fish species occurrence data from GBIF for the Danube River Basin using the danubeoccurR package. The cleaned data can be further analyzed or used for conservation strategies. This pipeline can also be integrated into broader biodiversity assessment workflows.

Acknowledgements

Funding was provided through the DANUBE4all project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement no. 101093985.